Bad weather Kelheim Demo

Oleksandr Soboliev

Data

Data is taken from these resources:

Regression analysis resources

Analysis was proceeded using this statistical sources:

Importing and preparing data

Main target is to get result table containing all “expected” relevant data/variables that can be connected with changes of demand in Kelheim region, thus mobility data is collected on daily basis, remaining data has to be daily based.

After obtaining such table it would be useful(in sense of building a model and understanding the data) to check correlations between individual independent variable to dependent mobility variable. So the next chapter has a focus on filtering out unrelevant data to pass it to the model.

##Modify and join data

# Weatherstack
weatherstack_kelheim_daily = weatherstack_kelheim %>%
  group_by(date) %>%
  count(description)

# Stringency 
deu_stringency = json[grep("DEU.stringency_actual",names(json))]
date_stringency = sapply(strsplit(names(deu_stringency),split = ".",fixed = TRUE),"[[",2)
df_stringency = data.frame(date = date_stringency,stringency = deu_stringency)
df_stringency = df_stringency %>% mutate(stringency = as.numeric(stringency), date = as.Date(date))



# Ingolstadt
type_of_weather = unique(weatherstack_kelheim$description)
map_vector <- c("Clear","Sunny","Cloudy","Light","Light","Light","Light","Light","Light","Light","Light","Medium","Cloudy","Light","Light","Heavy","Heavy","Heavy","Light","Medium","Heavy","Heavy","Light","Heavy","Heavy","Heavy","Heavy","Heavy","Heavy","Light","Medium","Medium","Light","Heavy","Light","Light","Light","Light","Light","Heavy","Light","Medium","Heavy","Heavy","Heavy")
names(map_vector)<- type_of_weather




ingolstadt_weather = ingolstadt_weather %>% 
  mutate(season = ifelse(month(date) %in% c(12,1,2),"winter",NA)) %>%
  mutate(season = ifelse(month(date) %in% c(3,4,5),"spring",season)) %>%
  mutate(season = ifelse(month(date) %in% c(6,7,8),"summer",season)) %>%
  mutate(season = ifelse(month(date) %in% c(9,10,11),"autumn",season))# %>% dplyr::select(-tsun)




day_description_impact = weatherstack_kelheim_daily %>% pivot_wider(names_from = description,values_from = n)

#remove NAs
day_description_impact[is.na(day_description_impact)] = 0

day_description_impact = day_description_impact %>% pivot_longer(cols = all_of(type_of_weather),names_to = "description",values_to = "value")

day_description_impact = day_description_impact
day_description_impact$description = map_vector[(day_description_impact$description)]

day_description_impact= day_description_impact %>% group_by(date)%>%
  top_n(n = 1,value) %>% group_by(date) %>% top_n(n = 1,description) %>% rename(weather_impact = value)

#####Join the data#####

result_data = demand %>% inner_join(day_description_impact, by = "date") %>% inner_join(ingolstadt_weather,by = "date") %>% inner_join(df_stringency,by = "date") %>% mutate(date = as.Date(date,format = "%Y-%m-%d"))
#Also need to be added weekday
result_data = result_data %>% mutate(wday = as.character(wday(date,week_start = 1)))

#Append holidays
result_data = result_data %>% left_join(df_holidays, by = "date") %>% replace_na(list(isHoliday = FALSE,snow = 0)) #%>% filter(noRides != 0) #%>% filter(date <"2021-07-01")

head(result_data)

From following plots we can see strong correlation between day of the week and number of Rides, because we want to simulate common day using MatSim it was decided to filter weekends out of resulting dataset for the model. Also because day doesn’t represent weather impact, like holidays that have strong impact on mobility so they are also excluded. First 4 days are excluded, because they have 0 rides due to KeXi service start.

wday_plot = ggplotly(ggplot(result_data)+
  geom_boxplot(aes(x = wday,y = noRides)))

holiday_plot = ggplotly(ggplot(result_data)+
  geom_boxplot(aes(x = isHoliday,y = noRides )))

annotations = list( 
  list( 
     x = 0.2,  
    y = 1.0,  
    text = "Weekday",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
     x = 0.75,  
    y = 1.0,  
    text = "Is Holiday",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ))

subplot(wday_plot,holiday_plot) %>% layout(annotations = annotations)
result_data = result_data %>% filter(wday!=6 & wday!=7,isHoliday == FALSE, noRides!=0)

After first data processing it would be helpful to find some dependencies in the data using scatter plots mapped to number of rides. Here is summary of end dataset

result_data$description = factor(result_data$description)
result_data$season = factor(result_data$season)
summary(result_data)
##       date               noRides         noRequests    avgEuclidianDistance_m avgTravelTime_s description 
##  Min.   :2020-06-30   Min.   :  1.00   Min.   :  4.0   Min.   : 716.7         Min.   :  0.0   Clear :  6  
##  1st Qu.:2020-10-31   1st Qu.: 63.50   1st Qu.:132.0   1st Qu.:1080.7         1st Qu.:370.2   Cloudy:175  
##  Median :2021-03-15   Median : 82.00   Median :161.0   Median :1225.3         Median :426.6   Heavy : 18  
##  Mean   :2021-04-06   Mean   : 84.14   Mean   :169.9   Mean   :1565.2         Mean   :351.2   Light :125  
##  3rd Qu.:2021-09-14   3rd Qu.:107.00   3rd Qu.:198.5   3rd Qu.:2160.0         3rd Qu.:466.8   Medium:  6  
##  Max.   :2022-01-28   Max.   :150.00   Max.   :408.0   Max.   :4689.6         Max.   :584.6   Sunny : 24  
##  weather_impact      tavg             tmin              tmax             prcp             snow        
##  Min.   : 4.0   Min.   :-9.100   Min.   :-13.900   Min.   :-5.400   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 9.0   1st Qu.: 2.700   1st Qu.: -0.350   1st Qu.: 5.625   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median :12.0   Median : 8.250   Median :  4.300   Median :13.150   Median : 0.000   Median : 0.0000  
##  Mean   :12.9   Mean   : 9.209   Mean   :  5.029   Mean   :13.565   Mean   : 1.627   Mean   : 0.5085  
##  3rd Qu.:17.0   3rd Qu.:16.275   3rd Qu.: 11.600   3rd Qu.:21.175   3rd Qu.: 1.100   3rd Qu.: 0.0000  
##  Max.   :24.0   Max.   :25.100   Max.   : 18.000   Max.   :34.200   Max.   :48.900   Max.   :50.0000  
##       wdir            wspd             wpgt             pres          tsun            season      stringency   
##  Min.   :  7.0   Min.   : 1.400   Min.   :  7.60   Min.   : 985.4   Mode:logical   autumn:123   Min.   :37.04  
##  1st Qu.:107.5   1st Qu.: 5.000   1st Qu.: 20.50   1st Qu.:1014.2   NA's:354       spring: 43   1st Qu.:55.09  
##  Median :230.0   Median : 6.650   Median : 27.70   Median :1018.9                  summer: 91   Median :62.04  
##  Mean   :195.1   Mean   : 7.974   Mean   : 30.72   Mean   :1018.9                  winter: 97   Mean   :63.37  
##  3rd Qu.:258.0   3rd Qu.: 9.625   3rd Qu.: 38.90   3rd Qu.:1024.4                               3rd Qu.:75.00  
##  Max.   :356.0   Max.   :29.200   Max.   :100.10   Max.   :1043.4                               Max.   :85.19  
##      wday           isHoliday      
##  Length:354         Mode :logical  
##  Class :character   FALSE:354      
##  Mode  :character                  
##                                    
##                                    
## 

Finding patterns using graphic approach

tavg_plot = ggplotly(
  ggplot(result_data)+
    geom_point(aes(y= noRides,x = tavg,colour = season))
  )


pres_plot= ggplotly(
  ggplot(result_data)+
    geom_point(aes(y= noRides,x = pres,colour = season))
  )

prcp_plot = ggplotly(
  ggplot(result_data %>% filter(prcp!=0))+
    geom_point(aes(y= noRides,x = prcp,colour = season))
  )
snow_plot = ggplotly(
  ggplot(result_data %>% filter(snow!=0))+
    geom_point(aes(y= noRides,x = snow,colour = season))
  )

subplot(tavg_plot,pres_plot,prcp_plot,snow_plot,nrows = 2)

After looking at these plots we can barely see existing strong dependencies or clusters, but there is one conspicious fact, that only during summer number of rides falls under ~50 while other seasons have more than 50 rides.

Finding correlations

Correlation coefficients between two variables we calculate using cor() function, later anova oneway test will be used for testing to check categorical variable dependency.

best_pred <- result_data %>% ungroup() %>%
  dplyr::select(-noRides,-description ,-date,-season,-noRequests,-avgEuclidianDistance_m,-avgTravelTime_s,-wday) %>%
  map_dbl(cor,y = result_data$noRides) %>%
  #map_dbl(abs) %>%
  sort(decreasing = TRUE) 
print(best_pred)
##           pres           snow           wspd           wdir           prcp           wpgt     stringency 
##     0.18474574     0.10739362     0.03762682     0.00967347    -0.01493169    -0.06206359    -0.12333568 
## weather_impact           tmin           tavg           tmax 
##    -0.21807896    -0.31030789    -0.36408674    -0.37219442

As a result highest absolute score in correlation have temperature(max), snow, pressure(air pressure), stringency(strictness of covid19 policy), weather impact(artificial made variable of highest weather description hours).

Testing of null hypothesis is made for dependent noRides and independent season, description and wday

season_test = oneway.test(result_data$noRides~result_data$season)$p.value
description_test = oneway.test(result_data$noRides~result_data$description)$p.value
wday_test = oneway.test(result_data$noRides~result_data$wday)$p.value
print(c("season" =season_test,"description" =description_test,"wday" =wday_test))
##       season  description         wday 
## 3.323634e-08 3.595709e-05 8.941111e-01

From the p-value we can see that null hypothesis can only be rejected for wday (p.value higher than 0.05), for the description and season we can’t make such conclusion. But one thing that should be also tested is dependency between seasons and temperature - 3.2481382^{-75}. As we can see temperature dependent from season, so in our future model one of the variable should be dropped.

Building a model

After determination of significant variables we can build different models using different set of variables, one thing remains same - first approach is to build linear model to make predictions, also to check which predictors impact number of rides most. In our approach we will firstly larger model using all relevant variables from previous chapter, then we reduce number of variables taking smaller subset and synchronous to check how model quality changes.

data = result_data

omega_model = lm(noRides ~ tavg+pres+stringency+snow+weather_impact*description,data = data)

summary(omega_model)
## 
## Call:
## lm(formula = noRides ~ tavg + pres + stringency + snow + weather_impact * 
##     description, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.680 -17.402  -2.459  17.193  69.934 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        30.0833   196.6141   0.153   0.8785    
## tavg                               -1.4445     0.2392  -6.040 4.06e-09 ***
## pres                                0.2128     0.1880   1.132   0.2584    
## stringency                         -0.6841     0.1376  -4.972 1.06e-06 ***
## snow                                0.3732     0.4395   0.849   0.3964    
## weather_impact                     -9.4743     6.5340  -1.450   0.1480    
## descriptionCloudy                 -96.6944    64.2373  -1.505   0.1332    
## descriptionHeavy                 -105.1883    68.2478  -1.541   0.1242    
## descriptionLight                 -109.3593    64.2148  -1.703   0.0895 .  
## descriptionMedium                -113.1355    72.3294  -1.564   0.1187    
## descriptionSunny                  -92.1232    74.9283  -1.229   0.2197    
## weather_impact:descriptionCloudy    8.6618     6.5467   1.323   0.1867    
## weather_impact:descriptionHeavy    10.1402     6.9605   1.457   0.1461    
## weather_impact:descriptionLight     9.8958     6.5642   1.508   0.1326    
## weather_impact:descriptionMedium   10.5065     7.6842   1.367   0.1724    
## weather_impact:descriptionSunny     9.3651     7.4120   1.264   0.2073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.78 on 338 degrees of freedom
## Multiple R-squared:  0.2574, Adjusted R-squared:  0.2245 
## F-statistic: 7.811 on 15 and 338 DF,  p-value: 4.418e-15

Low adjusted R^2 doesn’t tells that our model doesn’t fit the data. Let’s take a look on predicted values and residuals.

colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = omega_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))


ggplotly(ggplot(test_data %>% filter(year(date)>=2020)) +
  geom_line(aes(x = date,y = noRides,color = "actual"))+
  geom_line(aes(x = date,y = pred,color = "predicted"))+
  geom_line(aes(x = date,y = resid,color = "residuals"))+
  geom_ref_line(h = 0)+
  scale_color_manual(values = colors))

As we can see from our residuals plot our data has continuous growing trend, that our model doesn’t consider, so taking a date variable to model can drastically improve model “quality”.

omega_date_model = lm(noRides ~ tavg+pres+stringency+snow+weather_impact*description+date+season,data = data)

summary(omega_date_model)
## 
## Call:
## lm(formula = noRides ~ tavg + pres + stringency + snow + weather_impact * 
##     description + date + season, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.141  -8.559  -0.059  10.232  39.732 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -2.340e+03  1.532e+02 -15.271  < 2e-16 ***
## tavg                             -2.652e-01  2.220e-01  -1.195    0.233    
## pres                             -1.044e-01  1.141e-01  -0.915    0.361    
## stringency                       -6.280e-02  1.202e-01  -0.522    0.602    
## snow                              1.964e-01  2.635e-01   0.746    0.456    
## weather_impact                   -2.291e+00  3.924e+00  -0.584    0.560    
## descriptionCloudy                -1.913e+01  3.859e+01  -0.496    0.621    
## descriptionHeavy                 -2.759e+01  4.097e+01  -0.674    0.501    
## descriptionLight                 -2.139e+01  3.858e+01  -0.554    0.580    
## descriptionMedium                -2.452e+01  4.342e+01  -0.565    0.573    
## descriptionSunny                 -3.205e+01  4.500e+01  -0.712    0.477    
## date                              1.372e-01  5.631e-03  24.364  < 2e-16 ***
## seasonspring                     -6.217e+00  4.052e+00  -1.534    0.126    
## seasonsummer                     -1.223e+01  3.058e+00  -4.001 7.77e-05 ***
## seasonwinter                     -5.933e+00  3.622e+00  -1.638    0.102    
## weather_impact:descriptionCloudy  1.852e+00  3.933e+00   0.471    0.638    
## weather_impact:descriptionHeavy   3.100e+00  4.175e+00   0.742    0.458    
## weather_impact:descriptionLight   1.615e+00  3.944e+00   0.409    0.682    
## weather_impact:descriptionMedium  1.830e+00  4.617e+00   0.396    0.692    
## weather_impact:descriptionSunny   2.684e+00  4.456e+00   0.602    0.547    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.99 on 334 degrees of freedom
## Multiple R-squared:  0.7384, Adjusted R-squared:  0.7235 
## F-statistic: 49.61 on 19 and 334 DF,  p-value: < 2.2e-16
colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = omega_date_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))


ggplotly(ggplot(test_data %>% filter(year(date)>=2020)) +
  geom_line(aes(x = date,y = noRides,color = "actual"))+
  geom_line(aes(x = date,y = pred,color = "predicted"))+
  geom_line(aes(x = date,y = resid,color = "residuals"))+
  geom_ref_line(h = 0)+
  scale_color_manual(values = colors))

As expected R^2 squared increased as well residual standard error is decreased to a number of 15,99. To check correctness of used approach residuals have to be normally distributed.

Residuals histogram

barplot <- ggplot(test_data, aes(x = resid ))+
  geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=7)+
  geom_density( fill="#FF6666",adjust = 10,alpha = 0.5) +
  ggtitle("Omega model residuals")

ggplotly(barplot)
normal_dist = fitdistrplus::fitdist(test_data$resid,"norm")
plot(normal_dist)

From described plots we can see that our residuals looks normally distributed with skewness to left tail, with some outliers. But simple Shapiro-Wilk test gives p-value of 1.2434399^{-8} out, so our residuals aren’t

Reducing a model

Let’s take some variables out and see how model performs on the data and F Statistics

As we can see excluding variables from model doesn’t make it worse, we can also use anova test function from stats package to check model varable significance.

Excluding more variables from a model doesn’t impove model as well increases residual error enormous, so we can make conclusion, that represented model is minimal fitting possible with variables that have most impact to number of rides by the weather. Remains only to check residuals of fitted model.